DSCI 552 PS 1 - Used Car Dataset

EDA / cleaning / transforming step

Your CEO said: “The dataset describes conditions of various used cars and their current prices. I would like to learn what drives prices of used cars.

  1. Look at the dataset and find the main factors that affect the value of a car – and then explain it to me.

  2. Additionally, assess the impact of some special modifications (denoted by F1, F2, F3 and F4 in your dataset) on the price. This would help us to understand, if we should make the modifications before selling a car or not. I would like to see the report, describing your main findings, on my desk, on Thursday, February 11, 2021 at 10 A.M. “

Hint: You are asked to find general trends in the data. Report whatever you think is the most important. Your CEO doesn’t want to see a list that is 20-times long. She would like to learn just about some general trends. To give you an example, one general trend could be “The price decrease with the age of the car. Holding all other factors constant, with each year, the price of a car decreases by \$570. However, these dynamics are not constant. Value of younger cars decreases faster than the value of an old car. For example, the value of cars that are less than 5 years old, decreases nearly $2,500 per year.” (This is just an example; your numbers might be different). Your second task you have to check both, the impact and the statistical significance of the F1-F4 attributes for making the price predictions.

Modeling step

Your Technical Manager said: “I would like you to propose a predictive model, that can be used to determine price of a used car. The problem is that the state-law demands that this model be easily interpretable. It means that we are restricted to use simple methods like Linear Regression, Ridge Regression, LASSO and Elastic Net. Additionally, we need to know how accurate the model is. You must choose the best model and report its root mean square error. Describe everything in your report and I will study it carefully”. Hint: In the most typical approach, you need to build three datasets: a training set, a validation set and a test set. You will use validation set to determine the best model; the test set to estimate model accuracy. In your report you should describe how you trained the models, how you selected the best one and how you tested its performance at the end.

The Senior Developer took you aside and said: “My task is to deploy your model to production. But I cannot deploy a paper-report. I need your code. However, remember that I am not a Data Scientist list you. I have a different expertise. I will read your code, but you should make sure that I can follow and understand it – and that I know how to use it.”

Hint: In the ideal case, people should be able to take your code, run it and recreate all your results. In a less ideal case, it should be a demonstration of typical run. The code should demonstrate your approach end-to-end. People should just specify the path to the dataset, run it and see final results. Another name for this is a technical demo. At your future work, you might be quite often asked to demo your results. People will expect you to present an end-to- end example where you read the raw data, train your model and evaluate the results of the predictions.